INSERT DESCRIPTION AND METHODS (NOELLE)
Physicochemical metrics by site
Discuss with Maggie on what should be displayed here. Also Need shape files
The first thing we want to do is load the data packet produced by the final step of the DADA2 workflow. This packet (combo_pipeline.rdata) contains the ASV-by-sample table and the ASV taxonomy table.
After we load the data packet we next need to format sample names and define groups. We will use the actual sample names to define the different groups.
Sample abbreviations are as follows:
X indicates the Environment
YY indicates the site name
Z indicates the replicate number
So MCR5 is a Mat sample from Cayo Roldan, replicate 5.
Table of samples
| SamName | ENV | SITE |
|---|---|---|
| CCC1 | Coral | Coral Caye |
| CCC2 | Coral | Coral Caye |
| CCC3 | Coral | Coral Caye |
| CCC4 | Coral | Coral Caye |
| CCC5 | Coral | Coral Caye |
| CCC6 | Coral | Coral Caye |
| CCR1 | Coral | Cayo Roldan |
| CCR2 | Coral | Cayo Roldan |
| CCR3 | Coral | Cayo Roldan |
| CCR4 | Coral | Cayo Roldan |
| CCR5 | Coral | Cayo Roldan |
| CCR6 | Coral | Cayo Roldan |
| MCR1 | Mat | Cayo Roldan |
| MCR2 | Mat | Cayo Roldan |
| MCR3 | Mat | Cayo Roldan |
| MCR5 | Mat | Cayo Roldan |
| SCC1 | Sediment | Coral Caye |
| SCC2 | Sediment | Coral Caye |
| SCC3 | Sediment | Coral Caye |
| SCR1 | Sediment | Cayo Roldan |
| SCR2 | Sediment | Cayo Roldan |
| SCR3 | Sediment | Cayo Roldan |
| WCC0 | Water | Coral Caye |
| WCC1 | Water | Coral Caye |
| WCC2 | Water | Coral Caye |
| WCC3 | Water | Coral Caye |
| WCR0 | Water | Cayo Roldan |
| WCR1 | Water | Cayo Roldan |
| WCR2 | Water | Cayo Roldan |
| WCR3 | Water | Cayo Roldan |
ps) object serves as the basis for all downstream analyses.Create base phyloseq object
A. Create a phyloseq (ps) object using the silva (slv) taxonomy generated in the DADA2 workflow.
B. Rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.
[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"
C. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file.
D. Export sequence and taxonomy tables for the unadulterated phyloseq object to the DATA_TABLES directory for later use. We will use the prefix full to indicate that these are raw data products.
Remove contaminants & unwanted taxa
E. Remove any Mitochondria hits. Remember the original dataset contained 6160 ASVs.
To accomplish this we do the following:
ps object of just Mitochondria,ps object.phyloseq-class experiment-level object
otu_table() OTU Table: [ 6134 taxa and 30 samples ]
sample_data() Sample Data: [ 30 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 6134 taxa by 8 taxonomic ranks ]
Looks like this removed 26 Mitochondria ASVs. We will duplicate the code block to remove other groups. We will also get rid of any Eukaryota and Chloroplast ASVs.
F. Remove any Chloroplast hits.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 5938 taxa and 30 samples ]
sample_data() Sample Data: [ 30 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 5938 taxa by 8 taxonomic ranks ]
The code removed an additional 196 ASVs classified as Chloroplast.
G. Remove any Eukaryotic hits.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 5933 taxa and 30 samples ]
sample_data() Sample Data: [ 30 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 5933 taxa by 8 taxonomic ranks ]
Finally, 5 Eukaryota ASVs were eliminated from the ps object.
We will again export the sequence and taxonomy tables to the DATA_TABLES directory this time with the file prefix trim to indicate that contaminants have been removed from these data.
The raw phyloseq object.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 6160 taxa and 30 samples ]
sample_data() Sample Data: [ 30 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 6160 taxa by 8 taxonomic ranks ]
The raw phyloseq object contains:
phyloseq-class experiment-level object
otu_table() OTU Table: [ 5933 taxa and 30 samples ]
sample_data() Sample Data: [ 30 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 5933 taxa by 8 taxonomic ranks ]
The working phyloseq object with contaminants removed(ps_slv_work_filt) contains:
The mean number of reads per sample is 12056. The minimum number of reads per sample is 154 while the maximum is 33829 reads.
Note the command to remove taxa (subset_taxa within phyloseq) removes anything that is NA for the taxonomic level of interest or above. That’s fine if you are working at the Phylum level—for example is you are only getting rid of unknown Bacteria and unknown unknowns. However if you want to run this command using Family != "Mitochondria" you will not only get rid all ASVs classified as Mitochondria but everything else that has NA for Family and above. This seems pretty insane to me. Our dataset only had 27 Mitochondria ASVs but running the command as is will remove any ASVs not classified at Family level or above. And it is not well documented that the command is doing this—I had to go digging through the files to figure out what was happening. So lets see if we can get rid of just Mitochondria ASVs.
Summary table of total reads & ASVs
At this point we can amend the sample summary table with total reads and ASVs for each sample. If you sort the table by total_reads you can see that 5 samples (all Coral) have less than 1000 reads. We should probably eliminate those samples from the analysis.
Two more things to do at this point are to create a merged phyloseq object grouped by environment AND create a phyloseq object for a subset of samples. These will come in handy later for some analyses. To make a merged ps object we collapse all samples from the same environment type together.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 5933 taxa and 4 samples ]
sample_data() Sample Data: [ 4 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 5933 taxa by 8 taxonomic ranks ]
[1] "Coral" "Mat" "Sediment" "Water"
Great, still 5933 ASVs but now only 4 “samples” corresponding to the unique environment types.
Water samples only
The second is to selcet only water samples and make a separate ps object. To do this you must select the samples you wish to keep. If you want to change the group of samples, modify the script accordingly. For now this function is off meaning you must modify the eval flag in the code chunk and list sample names for this to run.
phyloseq-class experiment-level object
otu_table() OTU Table: [ 5933 taxa and 8 samples ]
sample_data() Sample Data: [ 8 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 5933 taxa by 8 taxonomic ranks ]
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1111 taxa and 8 samples ]
sample_data() Sample Data: [ 8 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 1111 taxa by 8 taxonomic ranks ]
So for water samples only there are 1111 ASVs and 8 samples.
ps_slv: Raw phyloseq object
phyloseq-class experiment-level object
otu_table() OTU Table: [ 6160 taxa and 30 samples ]
sample_data() Sample Data: [ 30 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 6160 taxa by 8 taxonomic ranks ]
This ps object contains 6160 ASVs, 380461 total reads, and 30 samples.
ps_slv_work_filt: Contaminants removed
phyloseq-class experiment-level object
otu_table() OTU Table: [ 5933 taxa and 30 samples ]
sample_data() Sample Data: [ 30 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 5933 taxa by 8 taxonomic ranks ]
This ps object contains 5933 ASVs, 361694 total reads, and 30 samples after removing Mitochondria, Chloroplast, and Eukaryotes.
mergedGP: Grouped by environment
phyloseq-class experiment-level object
otu_table() OTU Table: [ 5933 taxa and 4 samples ]
sample_data() Sample Data: [ 4 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 5933 taxa by 8 taxonomic ranks ]
This ps object contains 5933 ASVs, 361694 total reads, and 30 samples after collapsing by Environment (ENV).
ps_slv_water: Water samples only
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1111 taxa and 8 samples ]
sample_data() Sample Data: [ 8 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 1111 taxa by 8 taxonomic ranks ]
This ps object contains 1111 ASVs, 187164 total reads, and 8 samples after collapsing by Environment (ENV).
At this point in the workflow there are the several different phyloseq objects to chose from and, using the above methods, additional objects can easily be created.
We created two different color palettes—one for taxa and one for environment—using the same 9 colors.
My soapbox about color
Much of the subsequent taxonomic analyses will be at the Class & Family levels. We chose not to focus on the Genus level because there simply is not enough resolution in our dataset to build a cohesive story. This is because most marine systems are (microbially) understudied and we are dealing with short read data. On the other hand, Phylum level is too coarse for groups like Proteobacteria and Firmicutes. Order did not provide any additional information and can be cumbersome for taxa with poorly resolved lineages. Depending on the dataset, you may want to change your strategy.
Color blindness & graphics
Before we continue we need to talk about color. Microbial diversity is huge and it is difficult to display all of this diversity in a single, static figure. As microbial ecologists we often use colors to convey our message. Yes many of us have a decreased ability to see color or differences in color. So for our figures we are going to create a custom, color friendly palette based on Bang Wong’s scheme described in this paper. Wong’s color scheme uses contrasting colors that can be distinguished by folks with color vision deficiency—roughly 8% of people (mostly males) are color blind. Do want Keanu Reeves to understand your figures or not?
This scheme is conservative—there are only 7 colors. We added black and grey to give us a little wiggle room. Others have developed 12 and 15 color palettes, but be careful—figures with too many colors can inhibit our ability to discern patterns. Our conservative palette forces us to choose carefully when deciding which taxa to target or how many groups to display.
Here are the hex codes for the colors:
#009E73, #D55E00, #F0E442, #CC79A7, #56B4E9, #E69F00, #0072B2, #7F7F7F, #000000
Total reads & ASVs by Class
Most reads by taxa: Alphaproteobacteria, Bacteroidia, Clostridia, Deltaproteobacteria, Gammaproteobacteria, Oxyphotobacteria
Most ASVs by taxa: Alphaproteobacteria, Bacteroidia, Clostridia, Deltaproteobacteria, Gammaproteobacteria, Planctomycetacia
Class-level relative abundance
Lets take a closer look Class-level taxonomic content of these communities. There are numerous ways to do this but here we chose to collapse samples by environment and display the relative abundance of the most dominant taxa. We also generated alternative views of taxonomic composition for individual samples—a box-and-whisker plot as well as separate bar plots.
I know stacked bar charts are pretty lame but we like them for a birds eye view of the data. Here we calculate the relative abundance of taxa for each environment at the Class level. It turns out this is not too easy in phyloseq and there is a lot of (messy) code.
Since our goal is to generate a figure and we only have 9 colors, some taxa will need to be put into an Other category. We can define ‘Other’ however we like so lets take a look at the overall relative abundance of each Class.
Inspecting the table it looks like if we choose a cutoff of 0.01748 (1.748%) we get 8 taxa—sounds pretty good. The rest go into the ‘Other’ category. No matter what, we will always gloss over some groups using such a coarse approach. But as we see later some of these lower abundance groups will reappear when we look at communities at the level of individual ASVs. Next we define the Other category and draw the graph.
At a 1.5% abundance cutoff, 91 Classes are grouped into the ‘Other’ category. Great, now we can craft the bar chart. And here are the taxa that will go into the chart:
It took some tweaking to get the bar chart to look just right—so there is a lot of code here—and it could most certainly be better :/. While we’re at it we will also save a copy of the figure so we can tweak it later and make it look pretty.
One thing to notice is that the sediment samples have a large percentage of ASVs grouped into the Other category. We will see what these taxa are in the next panel.
Here we see the same data as in the last figure except now presented as individual samples. We see that the sediment samples all have a larger percentage of Other taxa. If we examine Phylum level abundance in sediment samples we see several groups that are not well represented in the rest of the dataset including
| Phylum | mean Sed samples | mean All samples |
|---|---|---|
| Proteobacteria | 0.3956 | 0.3382 |
| Bacteroidetes | 0.1646 | 0.2780 |
| Planctomycetes | 0.1228 | 0.0374 |
| Acidobacteria | 0.0884 | 0.0187 |
| Latescibacteria | 0.0260 | 0.0094 |
| Chloroflexi | 0.0220 | 0.0045 |
| Gemmatimonadetes | 0.0196 | 0.0044 |
| Spirochaetes | 0.0186 | 0.0162 |
| Kiritimatiellaeota | 0.0153 | 0.0041 |
| Thaumarchaeota | 0.0151 | 0.0032 |
| Calditrichaeota | 0.0138 | 0.0028 |
| Nanoarchaeaeota | 0.0115 | 0.0026 |
| Nitrospirae | 0.0105 | 0.0021 |
| Cyanobacteria | 0.0078 | 0.1061 |
| Euryarchaeota | 0.0060 | 0.0094 |
| Firmicutes | 0.0056 | 0.0918 |
| Crenarchaeota | 0.0052 | 0.0014 |
| Modulibacteria | 0.0049 | 0.0010 |
| Zixibacteria | 0.0040 | 0.0008 |
| BRC1 | 0.0032 | 0.0006 |
| Fibrobacteres | 0.0030 | 0.0006 |
| NA | 0.0028 | 0.0015 |
| Hydrogenedentes | 0.0027 | 0.0005 |
| Lentisphaerae | 0.0025 | 0.0008 |
| Actinobacteria | 0.0025 | 0.0075 |
| Omnitrophicaeota | 0.0023 | 0.0005 |
| Fusobacteria | 0.0019 | 0.0064 |
| Marinimicrobia_(SAR406_clade) | 0.0017 | 0.0049 |
| Verrucomicrobia | 0.0017 | 0.0026 |
| Dadabacteria | 0.0015 | 0.0008 |
| Elusimicrobia | 0.0014 | 0.0003 |
| Nitrospinae | 0.0012 | 0.0003 |
| PAUC34f | 0.0012 | 0.0002 |
| Asgardaeota | 0.0011 | 0.0002 |
| Schekmanbacteria | 0.0010 | 0.0002 |
| Entotheonellaeota | 0.0009 | 0.0002 |
| Tenericutes | 0.0007 | 0.0262 |
| AncK6 | 0.0007 | 0.0001 |
| Hydrothermarchaeota | 0.0005 | 0.0001 |
| LCP-89 | 0.0005 | 0.0001 |
| Dependentiae | 0.0004 | 0.0001 |
| Aegiribacteria | 0.0003 | 0.0001 |
| Cloacimonetes | 0.0002 | 0.0044 |
| Patescibacteria | 0.0002 | 0.0003 |
| TA06 | 0.0002 | 0.0000 |
| WS2 | 0.0001 | 0.0000 |
| Deinococcus-Thermus | 0.0001 | 0.0001 |
| Deferribacteres | 0.0001 | 0.0007 |
| Epsilonbacteraeota | 0.0001 | 0.0039 |
| Margulisbacteria | 0.0001 | 0.0000 |
| CK-2C2-2 | 0.0001 | 0.0000 |
| Altiarchaeota | 0.0000 | 0.0000 |
| WS1 | 0.0000 | 0.0000 |
| FCPU426 | 0.0000 | 0.0000 |
| Atribacteria | NA | 0.0000 |
| Synergistetes | NA | 0.0000 |
| Thermotogae | NA | 0.0006 |
| WPS-2 | NA | 0.0001 |
Alpha diversity describes the diversity in a sample or site. There are several alpha diversity metrics available in phyloseq: Observed, Chao1, ACE, Shannon, Simpson, InvSimpson, Fisher. Play around to see how different metrics change or confirm these results.
Here we want to know if diversity is significantly different across environments. In order to do that we need to know if we should run a parametric or non-parametric test, and for that we need to know if our data is normally distributed. Most of the ideas/code for alpha (and subsequent beta) diversity statistics come from this workshop tutorial by Kim Dill-McFarland and Madison Cox.
First we run the diversity estimates, add these data to our summary table, and save a copy of this table.
Shapiro-Wilk Normality Test for Shannon index.
Shapiro-Wilk normality test
data: sample_data(ps_slv_work_filt_div)$Shannon
W = 0.93, p-value = 0.05
Shapiro-Wilk Normality Test for inverse Simpson index.
Shapiro-Wilk normality test
data: sample_data(ps_slv_work_filt_div)$InvSimpson
W = 0.71, p-value = 0.000002
Shapiro-Wilk Normality Test for Chao1 richness estimator.
Shapiro-Wilk normality test
data: sample_data(ps_slv_work_filt_div)$Chao1
W = 0.89, p-value = 0.004
Shapiro-Wilk Normality Test for Observed ASV richness estimator.
Shapiro-Wilk normality test
data: sample_data(ps_slv_work_filt_div)$Observed
W = 0.89, p-value = 0.004
Next, we add the diversity estimates to our phyloseq object, and test if the data are normally distributed using Shapiro-Wilk Normality test. We will focus on the inverse Simpson and Shannon diversity estimates, the Chao richness estimate, and Observed richness but this approach can be used for any metric.
Ok, since the p-values are significant for the inverse Simpson, Chao richness, and Observed ASV richness we reject the null hypothesis that these data are normally distributed.
However, the Shannon estimates appear normally distributed. So lets see if diversity is significantly different between environment based on the Shannon index.
Normally distributed metrics
Df Sum Sq Mean Sq F value Pr(>F)
ENV 3 48.8 16.25 46 1.6e-10 ***
Residuals 26 9.2 0.35
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Shannon ~ ENV, data = sampledataDF)
$ENV
diff lwr upr p adj
Mat-Coral 2.4703 1.52853 3.4120 0.0000
Sediment-Coral 3.1813 2.36572 3.9969 0.0000
Water-Coral 0.7671 0.02254 1.5116 0.0417
Sediment-Mat 0.7110 -0.34190 1.7639 0.2727
Water-Mat -1.7032 -2.70211 -0.7043 0.0004
Water-Sediment -2.4142 -3.29518 -1.5333 0.0000
Non-normally distributed metrics
Kruskal-Wallis of inverse Simpson index.
Kruskal-Wallis rank sum test
data: InvSimpson by ENV
Kruskal-Wallis chi-squared = 20, df = 3, p-value = 0.0002
Pairwise significance test for inverse Simpson index.
Pairwise comparisons using Wilcoxon rank sum test
data: sampledataDF$InvSimpson and sampledataDF$ENV
Coral Mat Sediment
Mat 0.002 - -
Sediment 0.0006 0.011 -
Water 0.792 0.006 0.002
P value adjustment method: fdr
Kruskal-Wallis of Chao1 richness estimator.
Kruskal-Wallis rank sum test
data: Chao1 by ENV
Kruskal-Wallis chi-squared = 26, df = 3, p-value = 0.00001
Pairwise significance test for Chao1 richness estimator.
Pairwise comparisons using Wilcoxon rank sum test
data: sampledataDF$Chao1 and sampledataDF$ENV
Coral Mat Sediment
Mat 0.002 - -
Sediment 0.0003 0.067 -
Water 0.0001 0.034 0.001
P value adjustment method: fdr
Kruskal-Wallis of Observed ASV richness index.
Kruskal-Wallis rank sum test
data: Observed by ENV
Kruskal-Wallis chi-squared = 26, df = 3, p-value = 0.00001
Pairwise significance test for Observed ASV richness index.
Pairwise comparisons using Wilcoxon rank sum test
data: sampledataDF$Observed and sampledataDF$ENV
Coral Mat Sediment
Mat 0.002 - -
Sediment 0.0003 0.067 -
Water 0.0001 0.034 0.001
P value adjustment method: fdr
Since the Shannon data is normally distributed we can test for significance using ANOVA (a parametric test).
Ok, the results of the ANOVA are significant. Here we use the Tukey’s HSD (honestly significant difference) post-hoc test to determine which pairwise comparisons are different.
Looks like everything is significantly different except for Mat-Sediment communities from Cayo Roldan. Interesting.
Now we can look at the results on the inverse Simpson diversity and Chao’s richness. Since Environment is categorical, we use Kruskal-Wallis (non-parametric equivalent of ANOVA) to test for significance.
For the inverse Simpson, Chao1, and Observed richness the results of the Kruskal-Wallis rank sum test are significant. So we can look at pairwise comparisons using Wilcoxon rank sum test for post-hoc analysis.
Alpha diversity of microbial communities
Here we plot the results of each diversity index and include the Observed ASV richness. We will save a copy of the figure for later tweaking. We use the color palette described above to delineate environments.
Call:
metaMDS(comm = veganifyOTU(physeq), distance = distance)
global Multidimensional Scaling using monoMDS
Data: wisconsin(sqrt(veganifyOTU(physeq)))
Distance: bray
Dimensions: 2
Stress: 0.1584
Stress type 1, weak ties
Two convergent solutions found after 20 tries
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on 'wisconsin(sqrt(veganifyOTU(physeq)))'
Beta diversity basically tells us how similar or dissimilar samples are to one another. Phyloseq offers several ordination option using the method flag and distance metrics. Here we use non metric multidimensional scaling (NMDS) coupled with Jensen-Shannon divergence. We also save a copy of the figure for later tweaking.
We see that a convergent solution was reached around 20 iterations and our stress is below 0.20, meaning that 2-axes are sufficient to view the data. Generally, we are looking for stress values below 0.2. If the stress values are high, you may need to add more axes to the ordination. Lets visualize the plot.
So we can see some clustering within groups and spread between groups, but this is not a test for statistical differences. Do microbial communities differ significantly by environment?
First we use the adonis function in vegan to run a PERMANOVA test. This will tell us whether environments have similar centroids or not.
Call:
adonis(formula = fish.jsd ~ ENV, data = sampledf, permutations = 1000)
Permutation: free
Number of permutations: 1000
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
ENV 3 2.81 0.937 7.65 0.469 0.001 ***
Residuals 26 3.19 0.123 0.531
Total 29 6.00 1.000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
These results indicate that centroids are significantly different across environments meaning that communities are different by environments.
We can also use the pairwiseAdonis package for pair-wise PERMANOVA analysis.
pairs F.Model R2 p.value p.adjusted sig
1 Coral vs Mat 4.363 0.2376 0.001 0.006 *
2 Coral vs Sediment 4.837 0.2321 0.001 0.006 *
3 Coral vs Water 8.308 0.3158 0.001 0.006 *
4 Mat vs Sediment 9.299 0.5376 0.008 0.048 .
5 Mat vs Water 19.764 0.6640 0.002 0.012 .
6 Sediment vs Water 15.542 0.5643 0.001 0.006 *
Here we see again we see that communities are different by environments.
However, PERMANOVA assumes equal beta dispersion so we will use the betadisper function from the vegan package to calculate beta dispersion values.
Homogeneity of multivariate dispersions
Call: betadisper(d = fish.jsd, group = sampledf$ENV, bias.adjust =
TRUE)
No. of Positive Eigenvalues: 28
No. of Negative Eigenvalues: 1
Average distance to median:
Coral Mat Sediment Water
0.445 0.196 0.323 0.194
Eigenvalues for PCoA axes:
(Showing 8 of 29 eigenvalues)
PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8
1.310 0.831 0.804 0.447 0.400 0.264 0.242 0.232
And then a pair-wise Permutation test for homogeneity of multivariate dispersions using permutest (again from the vegan package).
Permutation test for homogeneity of multivariate dispersions
Permutation: free
Number of permutations: 1000
Response: Distances
Df Sum Sq Mean Sq F N.Perm Pr(>F)
Groups 3 0.377 0.126 12.6 1000 0.001 ***
Residuals 26 0.259 0.010
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pairwise comparisons:
(Observed p-value below diagonal, permuted p-value above diagonal)
Coral Mat Sediment Water
Coral 9.99e-04 9.99e-04 0.00
Mat 5.74e-10 2.00e-03 0.98
Sediment 1.62e-06 2.07e-04 0.12
Water 1.94e-04 9.80e-01 1.21e-01
These results are significant, meaning that environments have different dispersions. Looking at the pairwise p-values and permuted p-value, we see that the significant differences (p-value < 0.05) are between most environments.
This means we are less confident that the PERMANOVA result is a real result, and that the result is possibly due to differences in group dispersions.
We can also use Analysis of Similarity (ANOSIM)—which does not assume equal group variances—to test whether overall microbial communities differ by environment.
Call:
anosim(x = distance(ps_slv_work_filt, "jsd"), grouping = spgroup)
Dissimilarity:
ANOSIM statistic R: 0.701
Significance: 0.001
Permutation: free
Number of permutations: 999
Upper quantiles of permutations (null model):
90% 95% 97.5% 99%
0.0924 0.1265 0.1504 0.1752
Dissimilarity ranks between and within classes:
0% 25% 50% 75% 100% N
Between 52 173.75 260.5 345.25 409.5 320
Coral 28 72.25 106.0 223.75 409.5 66
Mat 13 14.25 15.5 16.75 19.0 6
Sediment 21 35.50 40.0 50.50 56.0 15
Water 1 7.75 21.0 31.50 48.0 28
And the AN0SIM result is significant meaning that environment influences microbial community composition.
To test whether microbial communities differ by environment we can use permutational analysis of variance (PERMANOVA) or analysis of similarity (ANOSIM). PERMANOVA does not assume normality but does assume equal beta dispersion between groups. We will test beta dispersion below.
---
title: "Bocas Hypoxia"
output:
flexdashboard::flex_dashboard:
storyboard: TRUE
source_code: embed
---
```{r setup, include=FALSE}
library(flexdashboard)
library(phyloseq); packageVersion("phyloseq")
library(ggplot2); packageVersion("ggplot2")
library(plotly)
library(DT)
library(kableExtra)
library(tidyverse)
library(data.table)
library(dplyr)
library("plyr"); packageVersion("plyr")
library(vegan)
library(pairwiseAdonis)
library(knitr)
```
```{r set_wd, include=FALSE}
remove(list = ls())
knitr::opts_knit$set(root.dir = getwd())
# This will setwd to wherever the .Rmd file is opened.
ptm <- proc.time()
start_time <- Sys.time()
opts_chunk$set(cache=TRUE)
#formatR::tidy_app() run this in R to tidy code. How to do it here?
```
Physicochemistry {.storyboard}
=========================================
### **Depth Profiles**: Dissolved Oxygen Depth Profiles {data-commentary-width=600}
INSERT DESCRIPTION AND METHODS (NOELLE)
```{r}
ds <- read.csv(file="DATA_TABLES/INPUT/depth_profiles.csv",
stringsAsFactors = FALSE)
```
```{r depth_profile}
ds$site_o= factor(ds$site, levels=c('Crawl_Key','Cayo_Roldon','Hospital_Point','Finca',
'Mainland', 'Pastores', 'Punta_Caracol', 'Almirante'))
p1 <- ggplot(ds, aes(DO,depth_m)) +
geom_point(shape = 16, size = 1.5, color = "darkblue") +
facet_wrap(~ site_o, ncol = 2) +
labs(#title = "Dissolved Oxygen Depth Profiles",
# subtitle = "Data plotted by site, Sept. 21, 2017",
y = "depth (m)",
x = "DO mg/L") +
geom_vline(xintercept = 2, linetype="dashed",color = "red", size=.5) +
scale_y_continuous(trans = "reverse", limits = c(11,0))
gp <- ggplotly(p1)
gp %>% layout(margin = list(b = 75))
#plotly::ggplotly(gp)%>%layout(autosize = T)
#geom_smooth(method = loess, color = "darkblue")+
#theme_bw(base_size = 10) +
# set scale to 11 m, add smoother line, and add hypoxia marker
```
***
Physicochemical metrics by site
```{r}
datatable(ds,
style="bootstrap",
class="compact",
rownames = FALSE,
width = "100%",
caption = htmltools::tags$caption(style =
"caption-side: top; text-align: left;",
"Table X: ", htmltools::em("Site Physicochemistry.")),
extensions = "Buttons",
options = list(columnDefs =
list(list(className = "dt-left", targets = 0)),
dom = "Blfrtip", buttons = c("csv", "copy"),
scrollX = TRUE, scrollCollapse = FALSE,
deferRender=TRUE, scrollY=300, scroller=TRUE,
lengthMenu = c(10, 100, 250)))
```
### **Response of coral to hypoxia**:
Discuss with Maggie on what should be displayed here. Also Need shape files
```{r hypoxia_coral_response, eval = FALSE, include = FALSE}
rm(list=ls())
library(lme4)
library(lmerTest)
library(car)
library(rcompanion)
library(MASS)
library(emmeans)
library(nlme)
###Do analysis on each response variable separately
###For each response- site as fixed factor and colony as random factor
data <- read.table("Data_Tables/INPUT/hypoxiadata.csv", header=T, sep=",")
##Make variables in dataframe accessible by name within a session
attach(data)
#head(data)
##Check the data class, categories etc.
#class(site)
#levels(site)
#class(colony)
#################################
######### PAM ###################
#################################
### First testing all data for assumptions of normality and homogeneity of variances
## Based on raw data (i.e., not transformed) and residuals
## Data are normal, variances hetergeneous (transform doesn't improve)
########## DISTRIBUTION OF RAW DATA ##############################
###Look at distribution of data with histogram
plotNormalHistogram(yield)
########## DISTRIBUTION OF RESIDUALS #############################
### First run the model to generate the residuals- then look at residuals to see if normally distributed
## mixed model with fragment nested in colony, site as fixed effect
lmepam<-lmer(yield~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res1 <- resid(lmepam)
#Produce a histogram of the residuals.
hist(res1,main="Yield",xlab="Residuals")
plotNormalHistogram(res1)
####Shapiro test on residuals from lme model p = 0.4312
pam.shapiro <- shapiro.test(res1) #runs a normality test on residuals
print(pam.shapiro) # null = normally distrubuted P<0.05 = non-normal
##QQ plots of residuals from lme model
qqnorm(res1,
ylab="Sample Quantiles for PAM")
qqline(res1,
col="red")
##### Levene Test
leveneTest(yield~ as.factor(site),data=data) # p = 0.00785
bartlett.test(x=yield, g=site) # p = 0.01178
#### Data normal but not homogenous- log transform does not improve
###########################################################
############# MIXED EFFECTS MODELS#################################
#### Site is fixed effect, colony is random, fragment is random nested in colony
#### Compared full model with fragment dropped
#### Best fit model (with lowest AIC) was the model with colony dropped
#### BUT using model that includes fragment because it's necessary random effect
##When comparing models with different random effects like here- use REML=FALSE. But when not comparing use REML=TRUE (default)
#full model- includes site, colony, and fragment nested in colony
#(colony/frag includes a term for colony alone and nests fragment in colony)
pam1<-lmer(yield~site+(1|colony/frag), data=data, REML=FALSE, na.action=na.exclude) #AIC -121.1218
ranova(pam1)
#frag nested in colony dropped- I think colony accounts for fragment nested inherently?
pam2<-lmer(yield~site+(1|colony), data=data, REML=FALSE, na.action=na.exclude) #AIC -123.1218
ranova(pam2)
mod.AIC<-AIC(pam1, pam2)
mod.AIC ### use the full model to account for nesting fragments
##Final models uses REML- not comparing models with different random effects
finalpam <- lmer(yield~site+(1|colony), data=data, REML=TRUE, na.action=na.exclude)
# p values for random effects
finalrandpam <- ranova(finalpam)
pam.table<-anova(finalpam, type=2) #ANOVA table
pam.table
finalrandpam
#################################
######### ZOOX ###################
#################################
### First testing all data for assumptions of normality and homogeneity of variances
## Based on raw data and residuals
## CR6B is an outlier and removed- log-transform data makes data normal and homogenous
########## DISTRIBUTION OF RAW DATA ##############################
###Look at distribution of data with histogram
attach(data)
plotNormalHistogram(zoox)
########## DISTRIBUTION OF RESIDUALS #############################
### First run the model to generate the residuals- then look at residuals to see if normally distributed
## mixed model with fragment nested in colony, site as fixed effect
## yield is normally distributed, but variances not homogenous
lmezoox<-lmer(zoox~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res2 <- resid(lmezoox)
#Produce a histogram of the residuals.
hist(res2,main="Zoox",xlab="Residuals")
plotNormalHistogram(res2)
####Shapiro test on residuals from lme model p = 0.004795
zoox.shapiro <- shapiro.test(res2) #runs a normality test on residuals
print(zoox.shapiro) # null = normally distrubuted (P<0.05 = non-normal
##QQ plots of residuals from lme model
qqnorm(res2,
ylab="Sample Quantiles for Zoox")
qqline(res2,
col="red")
##### Levene Test
leveneTest(zoox~ as.factor(site),data=data) # p = 0.0007486
##############################################################################################
################################### Data Transformations #####################################
##############################################################################################
###########################################################
########## Log transformation ##############################
############# Log transform improves normality- data are homegenous
### Add a new column to dataframe for log transformation
data$zoox_log = log(zoox)
### Check that log transformed data is added as a new column
head(data)
attach(data)
### Check normality of log-transformed data
plotNormalHistogram(zoox_log)
#Check normality of log-transformed residuals
lmezoox_log<-lmer(zoox_log~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res3 <- resid(lmezoox_log)
#Produce a histogram of the residuals.
hist(res3,main="zoox log",xlab="Residuals")
plotNormalHistogram(res3)
####Shapiro test on residuals from lme model = non-normal p = 0.3461
zooxlog.shapiro <- shapiro.test(res3) #runs a normality test on residuals
print(zooxlog.shapiro) # null = normally distrubuted (P<0.05 = non-normal
##QQ plots of residuals from lme model
qqnorm(res3,
ylab="Sample Quantiles for Zooxlog")
qqline(res3,
col="red")
##### Levene Test
leveneTest(zoox_log~ as.factor(site),data=data) # p = 0.7968
############# MIXED EFFECTS MODELS#################################
#### Site is fixed effect, colony is random, fragment is random nested in colony
#### Use log-transformed data
#(colony/frag includes a term for colony alone and nests fragment in colony)
zoox1<-lmer(zoox_log~site+(1|colony/frag), data=data, REML=FALSE, na.action=na.exclude) #AIC 72.8756
ranova(zoox1)
#frag nested in colony dropped- I think colony accounts for fragment nested inherently?
zoox2<-lmer(zoox_log~site+(1|colony), data=data, REML=FALSE, na.action=na.exclude) #AIC 70.8756
ranova(zoox2)
mod.AIC<-AIC(zoox1, zoox2)
mod.AIC ### use the full model to account for nesting fragments
##Final models uses REML- not comparing models with different random effects
finalzoox <- lmer(zoox_log~site+(1|colony), data=data, REML=TRUE, na.action=na.exclude)
# p values for random effects
finalrandzoox <- ranova(finalzoox)
zoox.table<-anova(finalzoox, type=2) #ANOVA table
zoox.table
finalrandzoox
#################################
######### Chla ###################
#################################
### First testing all data for assumptions of normality and homogeneity of variances
## Based on raw data and residuals
########## DISTRIBUTION OF RAW DATA ##############################
###Look at distribution of data with histogram
plotNormalHistogram(chla)
########## DISTRIBUTION OF RESIDUALS #############################
### First run the model to generate the residuals- then look at residuals to see if normally distributed
## mixed model with fragment nested in colony, site as fixed effect
## raw data are normal but variances not homogenous- log-transformed improves homogeneity
lmechla<-lmer(chla~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res4 <- resid(lmechla)
#Produce a histogram of the residuals.
hist(res4,main="Chla",xlab="Residuals")
plotNormalHistogram(res4)
####Shapiro test on residuals from lme model p = 0.3015
chla.shapiro <- shapiro.test(res4) #runs a normality test on residuals
print(chla.shapiro) # null = normally distrubuted (P<0.05 = non-normal
##QQ plots of residuals from lme model
qqnorm(res4,
ylab="Sample Quantiles for Zoox")
qqline(res4,
col="red")
##### Levene Test
leveneTest(chla~ as.factor(site),data=data) # p = 0.01606
#### Data are normal but variances are not homogenous
###########################################################
########## Log transformation ##############################
### Add a new column to dataframe for log transformation
data$chla_log = log(chla)
### Check that log transformed data is added as a new column
head(data)
attach(data)
### Check normality of log-transformed data
plotNormalHistogram(chla_log)
#Check normality of log-transformed residuals
lmechla_log<-lmer(chla_log~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res7 <- resid(lmechla_log)
#Produce a histogram of the residuals.
hist(res7,main="chla log",xlab="Residuals")
plotNormalHistogram(res7)
####Shapiro test on residuals from lme model = non-normal p = 0.1627
chlalog.shapiro <- shapiro.test(res7) #runs a normality test on residuals
print(chlalog.shapiro) # null = normally distrubuted (P<0.05 = non-normal
##QQ plots of residuals from lme model
qqnorm(res7,
ylab="Sample Quantiles for chlalog")
qqline(res7,
col="red")
##### Levene Test
leveneTest(chla_log~ as.factor(site),data=data) # p = 0.7487
############# MIXED EFFECTS MODELS#################################
#### Site is fixed effect, colony is random, fragment is random nested in colony
### on log transformed data
#(colony/frag includes a term for colony alone and nests fragment in colony)
chla_log1<-lmer(chla_log~site+(1|colony/frag), data=data, REML=FALSE, na.action=na.exclude) #AIC 55.11567
ranova(chla_log1)
#frag nested in colony dropped- I think colony accounts for fragment nested inherently?
chla_log2<-lmer(chla_log~site+(1|colony), data=data, REML=FALSE, na.action=na.exclude) #AIC 53.11567
ranova(chla_log2)
mod.AIC<-AIC(chla_log1, chla_log2)
mod.AIC ### use the full model to account for nesting fragments
##Final models uses REML- not comparing models with different random effects
finalchla_log <- lmer(chla_log~site+(1|colony), data=data, REML=TRUE, na.action=na.exclude)
# p values for random effects
finalchla_logrand <- ranova(finalchla_log)
chla_log.table<-anova(finalchla_log, type=2) #ANOVA table
chla_log.table
finalchla_logrand
#################################
######### chlc ###################
#################################
### First testing all data for assumptions of normality and homogeneity of variances
## Based on raw data and residuals
## Data need to be log transformed for normality and variances
########## DISTRIBUTION OF RAW DATA ##############################
###Look at distribution of data with histogram
plotNormalHistogram(chlc)
########## DISTRIBUTION OF RESIDUALS #############################
### First run the model to generate the residuals- then look at residuals to see if normally distributed
## mixed model with fragment nested in colony, site as fixed effect
## yield is normally distributed, but variances not homogenous
lmechlc<-lmer(chlc~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res5 <- resid(lmechlc)
#Produce a histogram of the residuals.
hist(res5,main="chlc",xlab="Residuals")
plotNormalHistogram(res5)
####Shapiro test on residuals from lme model p = 0.004475
chlc.shapiro <- shapiro.test(res5) #runs a normality test on residuals
print(chlc.shapiro) # null = normally distrubuted (P<0.05 = non-normal
##QQ plots of residuals from lme model
qqnorm(res5,
ylab="Sample Quantiles for Zoox")
qqline(res5,
col="red")
##### Levene Test
leveneTest(chlc~ as.factor(site),data=data) # p = 0.01316
#### Data not normally distributed or homogenous
###########################################################
########## Log transformation ##############################
############# Log transform improves normality and homogeneity
### Add a new column to dataframe for log transformation
data$chlc_log = log(chlc)
### Check that log transformed data is added as a new column
head(data)
attach(data)
### Check normality of log-transformed data
plotNormalHistogram(chlc_log)
#Check normality of log-transformed residuals
lmechlc_log<-lmer(chlc_log~ site + (1|colony), REML=FALSE, na.action=na.exclude)
res6 <- resid(lmechlc_log)
#Produce a histogram of the residuals.
hist(res6,main="chlc log",xlab="Residuals")
plotNormalHistogram(res6)
####Shapiro test on residuals from lme model = non-normal p = 0.8874
chlclog.shapiro <- shapiro.test(res6) #runs a normality test on residuals
print(chlclog.shapiro) # null = normally distrubuted (P<0.05 = non-normal
##QQ plots of residuals from lme model
qqnorm(res6,
ylab="Sample Quantiles for chlclog")
qqline(res6,
col="red")
##### Levene Test
leveneTest(chlc_log~ as.factor(site),data=data) # p = 0.884
############# MIXED EFFECTS MODELS#################################
#### Site is fixed effect, colony is random, fragment is random nested in colony
####### On log-transformed data
#(colony/frag includes a term for colony alone and nests fragment in colony)
chlc_log1<-lmer(chlc_log~site+(1|colony/frag), data=data, REML=FALSE, na.action=na.exclude) #AIC 64.52399
ranova(chlc_log1)
#frag nested in colony dropped- I think colony accounts for fragment nested inherently?
chlc_log2<-lmer(chlc_log~site+(1|colony), data=data, REML=FALSE, na.action=na.exclude) #AIC 62.52399
ranova(chlc_log2)
mod.AIC<-AIC(chlc_log1, chlc_log2)
mod.AIC ### use the full model to account for nesting fragments
##Final models uses REML- not comparing models with different random effects
finalchlc_log <- lmer(chlc_log~site+(1|colony), data=data, REML=TRUE, na.action=na.exclude)
# p values for random effects
finalchlc_logrand <- ranova(finalchlc_log)
chlc_log.table<-anova(finalchlc_log, type=2) #ANOVA table
chlc_log.table
finalchlc_logrand
```
16S rRNA Data Prep {.storyboard}
=========================================
### 1. **Define sample groups**: To prepare for downstream analysis, we first add group designations. {data-commentary-width=500}
1. The first thing we want to do is load the data packet produced by the final step of the DADA2 workflow. This packet (`combo_pipeline.rdata`) contains the ASV-by-sample table and the ASV taxonomy table.
2. After we load the data packet we next need to format sample names and define groups. We will use the actual sample names to define the different groups.
```{r deliniate_sample_types}
load("../01_DADA2_WORKFLOW/combo_pipeline.rdata")
samples.out <- rownames(seqtab)
subject <- sapply(strsplit(samples.out, "[[:digit:]]"), `[`, 1)
# this splits the string at first instance of a digit
sample_name <- substr(samples.out, 1, 999) # use the whole string for individuals
environ <- substr(samples.out, 0, 1) # use the first two letters for genus
site <- substr(samples.out, 2, 3) # use the next three letters for species
num_samp <- length(unique(sample_name))
num_environ <- length(unique(environ))
num_sites <- length(unique(site))
```
3. And finally we define a sample data frame that holds the different groups we extracted from the sample names. On the right are the sample names and the different group categories.
> Sample abbreviations are as follows:
**X** indicates the Environment
* W = Water
* S = Sediment
* C = Coral
* M = Mat
**YY** indicates the site name
* CC = Coral Caye
* CR = Cayo Roldan
**Z** indicates the replicate number
So `MCR5` is a Mat sample from Cayo Roldan, replicate 5.
***
Table of samples
```{r define_variables}
#define a sample data frame
samdf <- data.frame(SamName = sample_name, ENV = environ, SITE = site)
samdf$SITE <- gsub('CC', 'Coral Caye', samdf$SITE)
samdf$SITE <- gsub('CR', 'Cayo Roldan', samdf$SITE)
samdf$ENV <- gsub('W', 'Water', samdf$ENV)
samdf$ENV <- gsub('S', 'Sediment', samdf$ENV)
samdf$ENV <- gsub('C', 'Coral', samdf$ENV)
samdf$ENV <- gsub('M', 'Mat', samdf$ENV)
rownames(samdf) <- samples.out
kable(samdf, row.names = FALSE) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE) %>%
column_spec(1:3, width = "3.5cm")
```
### 2. **Create & modify a phyloseq object**: a phyloseq (`ps`) object serves as the basis for all downstream analyses.{data-commentary-width=650}
Create base phyloseq object
**A**. Create a `phyloseq` (ps) object using the silva (slv) taxonomy generated in the DADA2 workflow.
**B**. Rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention---ASV1, ASV2, ASV3, and so on.
```{r create_ps_object}
ps_slv <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE), sample_data(samdf),
tax_table(tax_silva)) # this create the phyloseq object
tax_table(ps_slv) <- cbind(tax_table(ps_slv), rownames(tax_table(ps_slv)))
taxa_names(ps_slv) <- paste0("ASV", seq(ntaxa(ps_slv))) # adding unique ASV names
tax_table(ps_slv) <- cbind(tax_table(ps_slv), rownames(tax_table(ps_slv)))
head(taxa_names(ps_slv))
```
**C**. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file.
```{r add_ASV_coulmn}
colnames(tax_table(ps_slv)) <- c("Kingdom", "Phylum", "Class", "Order",
"Family", "Genus", "ASV_SEQ", "ASV_ID")
```
**D**. Export sequence and taxonomy tables for the unadulterated phyloseq object to the `DATA_TABLES` directory for later use. We will use the prefix `full` to indicate that these are raw data products.
```{r export_seq_tax_tables}
write.table(tax_table(ps_slv), "DATA_TABLES/OUTPUT/full_tax_table.txt", sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_slv)), "DATA_TABLES/OUTPUT/full_seq_table.txt", sep="\t", quote = FALSE, col.names=NA)
```
Remove contaminants & unwanted taxa
**E**. Remove any Mitochondria hits. Remember the original dataset contained `r ntaxa(ps_slv)` ASVs.
To accomplish this we do the following:
* Subset the taxa to generate a `ps` object of just Mitochondria,
* Select the ASV column only, turn it into a factor, and use this to remove Mitochondria from the `ps` object.
```{r remove_specific_taxa}
# generate a file with mitochondria ASVs
MT1 <- subset_taxa(ps_slv, Family == "Mitochondria")
MT1 <- as(tax_table(MT1), "matrix")
MT1 <- MT1[, 8]
MT1df <- as.factor(MT1)
goodTaxa <- setdiff(taxa_names(ps_slv), MT1df)
ps_slv_work_no_mito <- prune_taxa(goodTaxa, ps_slv)
ps_slv_work_no_mito
```
Looks like this removed **`r ntaxa(ps_slv) - ntaxa(ps_slv_work_no_mito)` Mitochondria ASVs**. We will duplicate the code block to remove other groups. We will also get rid of any Eukaryota and Chloroplast ASVs.
**F**. Remove any Chloroplast hits.
```{r remove_specific_taxa2}
# generate a file with mitochondria ASVs
CH1 <- subset_taxa(ps_slv_work_no_mito, Order == "Chloroplast")
CH1 <- as(tax_table(CH1), "matrix")
CH1 <- CH1[, 8]
CH1df <- as.factor(CH1)
goodTaxa <- setdiff(taxa_names(ps_slv_work_no_mito), CH1df)
ps_slv_work_no_chloro <- prune_taxa(goodTaxa, ps_slv_work_no_mito)
ps_slv_work_no_chloro
```
The code removed an additional **`r ntaxa(ps_slv_work_no_mito) - ntaxa(ps_slv_work_no_chloro)` ASVs** classified as Chloroplast.
**G**. Remove any Eukaryotic hits.
```{r remove_specific_taxa3}
# generate a file with mitochondria ASVs
EU1 <- subset_taxa(ps_slv_work_no_chloro, Kingdom == "Eukaryota")
EU1 <- as(tax_table(EU1), "matrix")
EU1 <- EU1[, 8]
EU1df <- as.factor(EU1)
goodTaxa <- setdiff(taxa_names(ps_slv_work_no_chloro), EU1df)
ps_slv_work_filt <- prune_taxa(goodTaxa, ps_slv_work_no_chloro)
ps_slv_work_filt
```
Finally, **`r ntaxa(ps_slv_work_no_chloro) - ntaxa(ps_slv_work_filt)` Eukaryota ASVs** were eliminated from the `ps` object.
We will again export the sequence and taxonomy tables to the `DATA_TABLES` directory this time with the file prefix `trim` to indicate that contaminants have been removed from these data.
```{r export_seq_tax_tables_2}
write.table(tax_table(ps_slv_work_filt), "DATA_TABLES/OUTPUT/trim_tax_table.txt", sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_slv_work_filt)), "DATA_TABLES/OUTPUT/trim_seq_table.txt", sep="\t", quote = FALSE, col.names=NA)
```
***
The raw phyloseq object.
```{r, results = 'markdown' }
ps_slv
```
The raw phyloseq object contains:
* `r sum(otu_table(ps_slv))` total reads
* across `r ntaxa(ps_slv)` ASVs
* and `r nsamples(ps_slv)` samples
The working phyloseq object.
```{r, results = 'markdown' }
ps_slv_work_filt
```
The working phyloseq object with contaminants removed(`ps_slv_work_filt`) contains:
* `r sum(otu_table(ps_slv_work_filt))` total reads
* across `r ntaxa(ps_slv_work_filt)` ASVs
* and `r nsamples(ps_slv_work_filt)` samples
```{r}
mean_reads <- as.integer(mean(sample_sums(ps_slv_work_filt)))
min_reads <- as.integer(min(sample_sums(ps_slv_work_filt)))
max_reads <- as.integer(max(sample_sums(ps_slv_work_filt)))
```
The mean number of reads per sample is **`r mean_reads`**. The minimum number of reads per sample is **`r min_reads`** while the maximum is **`r max_reads`** reads.
***
**Note** the command to remove taxa (`subset_taxa` within phyloseq) removes anything that is `NA` for the taxonomic level of interest or above. That's fine if you are working at the Phylum level---for example is you are only getting rid of unknown Bacteria and unknown unknowns. However if you want to run this command using `Family != "Mitochondria"` you will not only get rid all ASVs classified as Mitochondria but everything else that has `NA` for Family and above. This seems pretty insane to me. Our dataset only had 27 Mitochondria ASVs but running the command *as is* will remove any ASVs not classified at Family level or above. And it is not well documented that the command is doing this---I had to go digging through the files to figure out what was happening. So lets see if we can get rid of **just** Mitochondria ASVs.
### 3. **Sample summary**: add total reads and total ASVs to sample summary table{data-commentary-width=500}
Summary table of total reads & ASVs
```{r sample_summary_table}
total_reads <- sample_sums(ps_slv_work_filt)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")
total_asvs <- estimate_richness(ps_slv_work_filt, measures = "Observed")
total_asvs <- total_asvs %>% rownames_to_column("Sample_ID")
sam_details <- samdf
rownames(sam_details) <- NULL
colnames(sam_details) <- c("Sample_ID", "ENV", "Site")
merge_tab <- merge(sam_details, total_reads, by = "Sample_ID")
merge_tab2 <- merge(merge_tab, total_asvs, by = "Sample_ID")
colnames(merge_tab2) <- c("Sample_ID", "Environment", "Site",
"total_reads", "total_ASVs")
```
```{r}
datatable(merge_tab2,
style="bootstrap",
class="compact",
rownames = FALSE,
width = "100%",
caption = htmltools::tags$caption(style =
"caption-side: top; text-align: left;",
"Table X: ", htmltools::em("Total reads & ASVs by sample.")),
extensions = "Buttons",
options = list(columnDefs =
list(list(className = "dt-left", targets = 0)),
dom = "Blfrtip", buttons = c("csv", "copy"),
scrollX = TRUE, scrollCollapse = TRUE, deferRender=TRUE, scrollY=300, scroller=TRUE,
lengthMenu = c(10, 30)))
```
***
At this point we can amend the sample summary table with total reads and ASVs for each sample. If you sort the table by total_reads you can see that 5 samples (all Coral) have less than 1000 reads. We should probably eliminate those samples from the analysis.
***
Two more things to do at this point are to create a merged phyloseq object grouped by environment AND create a phyloseq object for a subset of samples. These will come in handy later for some analyses. To make a merged `ps` object we collapse all samples from the same environment type together.
```{r merge, warning=FALSE}
mergedGP <- merge_samples(ps_slv_work_filt, "ENV")
SD <- merge_samples(sample_data(ps_slv_work_filt), "ENV")
mergedGP
sample_names(mergedGP)
```
Great, still `r ntaxa(mergedGP)` ASVs but now only `r nsamples(mergedGP)` "samples" corresponding to the unique environment types.
**Water samples only**
The second is to selcet only water samples and make a separate `ps` object. To do this you must select the samples you *wish to keep*. If you want to change the group of samples, modify the script accordingly. For now this function is *off* meaning you must modify the `eval` flag in the code chunk and list sample names for this to run.
```{r select_water_samples}
ps_slv_sub <- prune_samples(c("WCC0", "WCC1", "WCC2", "WCC3", "WCR0", "WCR1", "WCR2", "WCR3"), ps_slv_work_filt)
ps_slv_sub
```
6. If you remove samples you probably lost some ASVs. So you need to get rid of any ASVs that have a total of **0 reads**.
```{r remove_ASV_with_zeros_reads}
ps_slv_water <- prune_taxa(taxa_sums(ps_slv_sub) > 0, ps_slv_sub)
ps_slv_water
write.table(tax_table(ps_slv_water), "DATA_TABLES/OUTPUT/water_tax_table.txt", sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_slv_water)), "DATA_TABLES/OUTPUT/water_seq_table.txt", sep="\t", quote = FALSE, col.names=NA)
```
So for water samples only there are `r ntaxa(ps_slv_water)` ASVs and `r nsamples(ps_slv_water)` samples.
### 4. **Summary of Data Preparation**: add total reads and total ASVs to sample summary table{data-commentary-width=500}
> `ps_slv`: Raw phyloseq object
```{r}
ps_slv
```
This `ps` object contains `r ntaxa(ps_slv)` ASVs, `r sum(otu_table(ps_slv))` total reads, and `r nsamples(ps_slv)` samples.
> `ps_slv_work_filt`: Contaminants removed
```{r}
ps_slv_work_filt
```
This `ps` object contains `r ntaxa(ps_slv_work_filt)` ASVs, `r sum(otu_table(ps_slv_work_filt))` total reads, and `r nsamples(ps_slv_work_filt)` samples after removing Mitochondria, Chloroplast, and Eukaryotes.
> `mergedGP`: Grouped by environment
```{r}
mergedGP
```
This `ps` object contains `r ntaxa(ps_slv_work_filt)` ASVs, `r sum(otu_table(ps_slv_work_filt))` total reads, and `r nsamples(ps_slv_work_filt)` samples after collapsing by Environment (ENV).
> `ps_slv_water`: Water samples only
```{r}
ps_slv_water
```
This `ps` object contains `r ntaxa(ps_slv_water)` ASVs, `r sum(otu_table(ps_slv_water))` total reads, and `r nsamples(ps_slv_water)` samples after collapsing by Environment (ENV).
***
At this point in the workflow there are the several different phyloseq objects to chose from and, using the above methods, additional objects can easily be created.
Diversity {.storyboard}
=========================================
### **Overview** In this section we will tackle diversity by looking taxonomic content as well as alpha and beta diversity metrics. {data-commentary-width=600}
We created two different color palettes---one for taxa and one for environment---using the same 9 colors.
```{r define_color_blind_scheme}
friend_pal <- c("#009E73", "#D55E00", "#F0E442", "#CC79A7", "#56B4E9",
"#E69F00", "#0072B2", "#7F7F7F", "#000000")
samp_pal <- c("#CC79A7", "#E69F00", "#009E73", "#56B4E9")
plot(x=1:9, y=rep(5,9), pch=19, cex=7, col=friend_pal,
col.axis = 'white', col.lab = 'white')
```
***
My soapbox about color
Much of the subsequent taxonomic analyses will be at the **Class** & **Family** levels. We chose not to focus on the Genus level because there simply is not enough resolution in our dataset to build a cohesive story. This is because most marine systems are (microbially) understudied and we are dealing with short read data. On the other hand, Phylum level is too coarse for groups like Proteobacteria and Firmicutes. Order did not provide any additional information and can be cumbersome for taxa with poorly resolved lineages. Depending on the dataset, you may want to change your strategy.
Color blindness & graphics
Before we continue we need to talk about color. Microbial diversity is huge and it is difficult to display all of this diversity in a single, static figure. As microbial ecologists we often use colors to convey our message. Yes many of us have a decreased ability to see color or differences in color. So for our figures we are going to create a custom, color friendly palette based on Bang Wong's scheme described in this [paper](http://dx.doi.org/10.1038/nmeth.1618){target="_blank"}. Wong's color scheme uses contrasting colors that can be distinguished by folks with color vision deficiency---roughly 8% of people (mostly males) are color blind. Do want Keanu Reeves to understand your figures or not?
This scheme is conservative---there are only 7 colors. We added black and grey to give us a little wiggle room. Others have developed [12 and 15 color palettes](http://mkweb.bcgsc.ca/colorblind/){target="_blank"}, but be careful---figures with too many colors can inhibit our ability to discern patterns. Our conservative palette forces us to choose carefully when deciding which taxa to target or how many groups to display.
Here are the hex codes for the colors:
`r friend_pal`
### **Total reads & ASVs by Taxa**: What are the dominant taxa in this system? Here we take a Class-level look at overall diversity. {data-commentary-width=500}
Total reads & ASVs by Class
```{r diversity_table}
# generate the ASV table
tax_asv <- table(tax_table(ps_slv_work_filt)[, "Class"], exclude = NULL,
dnn = "Taxa")
tax_asv <- as.data.frame(tax_asv, make.names = TRUE)
# generate the reads table
tax_reads <- factor(tax_table(ps_slv_work_filt)[, "Class"])
tax_reads <- apply(otu_table(ps_slv_work_filt), MARGIN = 1, function(x)
{
tapply(x, INDEX = tax_reads, FUN = sum, na.rm = TRUE, simplify = TRUE)
})
tax_reads <- as.data.frame(tax_reads, make.names = TRUE)
tax_reads <- cbind(tax_reads, reads = rowSums(tax_reads))
tax_reads <- tax_reads[31]
tax_reads <- setDT(tax_reads, keep.rownames = TRUE)[]
# merge the two tables and make everything look pretty
# in an interactive table
taxa_read_asv_tab <- merge(tax_reads, tax_asv, by.x = "rn", by.y = "Taxa")
top_reads <- top_n(taxa_read_asv_tab, n = 6, wt = reads)
top_asvs <- top_n(taxa_read_asv_tab, n = 6, wt = Freq)
names(taxa_read_asv_tab) <- c("Taxa", "total reads", "total ASVs")
write.table(taxa_read_asv_tab, "DATA_TABLES/OUTPUT/Table_X_reads_ASVs_by_sample.txt",
sep = "\t", row.names = FALSE, quote = FALSE)
datatable(taxa_read_asv_tab,
rownames = FALSE,
width = "100%",
colnames = c("Taxa", "total reads", "total ASVs"),
caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;",
"Table X: ", htmltools::em("Total reads & ASVs by Class")),
extensions = "Buttons",
options = list(columnDefs = list(list(className = "dt-left", targets = 0)),
dom = "Blfrtip", buttons = c("csv", "copy"),
scrollX = TRUE, scrollCollapse = TRUE, deferRender=TRUE, scrollY=300, scroller=TRUE,
lengthMenu = c(10, 50, 100)))
top_reads2 <- top_reads[,-1]
rownames(top_reads2) <- top_reads[,1]
top_asvs2 <- top_asvs[,-1]
rownames(top_asvs2) <- top_asvs[,1]
```
***
>**Most reads by taxa**: `r row.names(top_reads2)`
>**Most ASVs by taxa**: `r row.names(top_asvs2)`
### **Relative read abundance**: In this section we calculate relative percent read abundance for the combined dataset. This allows us to collapse rare taxa into an **Other** category. {data-commentary-width=500}
Class-level relative abundance
```{r calc_rel_abund_and_merge}
# calculate the averages and merge by species
ps_slv_filt_AVG <- transform_sample_counts(ps_slv_work_filt, function(x) x/sum(x))
mergedGP_BAR <- merge_samples(ps_slv_filt_AVG, "ENV")
SD_BAR <- merge_samples(sample_data(ps_slv_filt_AVG), "ENV")
# merge taxa by rank. If you choose a different rank be sure to change
# the rank throughout this code chunk
mdata_phy <- tax_glom(mergedGP_BAR, taxrank = "Class", NArm = FALSE)
mdata_phyrel <- transform_sample_counts(mdata_phy, function(x) x/sum(x))
meltd <- psmelt(mdata_phyrel)
meltd$Class <- as.character(meltd$Class)
# calculate the total relative abundance for all taxa
means <- ddply(meltd, ~Class, function(x) c(mean = mean(x$Abundance)))
means$mean <- round(means$mean, digits = 5)
taxa_means <- means[order(-means$mean), ] # this order in decending fashion
taxa_means <- format(taxa_means, scientific = FALSE) # ditch the sci notation
top_perc <- top_n(taxa_means, n = 8, wt = mean)
top_perc$mean <- round(as.numeric(top_perc$mean), digits = 5)
min_top_perc <- round(as.numeric(min(top_perc$mean)), digits = 5)
```
```{r rel_abund_table}
datatable(taxa_means,
rownames = FALSE,
width = "100%",
colnames = c("Class", "mean"),
caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;",
"Table X: ", htmltools::em("Class-level relative abundance.")),
extensions = "Buttons",
options = list(columnDefs = list(list(className = "dt-left", targets = "_all")),
dom = "Blfrtip", buttons = c("csv", "copy"),
scrollX = TRUE, scrollCollapse = TRUE, deferRender=TRUE, scrollY=300, scroller=TRUE,
lengthMenu = c(10, 50, 100)))
write.table(taxa_means, "DATA_TABLES/OUTPUT/Table_X_Taxa_Means.txt",
sep = "\t", row.names = FALSE, quote = FALSE)
```
***
Lets take a closer look Class-level taxonomic content of these communities. There are numerous ways to do this but here we chose to collapse samples by environment and display the relative abundance of the most dominant taxa. We also generated alternative views of taxonomic composition for individual samples---a [box-and-whisker plot](#box-and-whisker-plot) as well as [separate bar plots](#separate-bar-plots).
I know stacked bar charts are pretty lame but we like them for a birds eye view of the data. Here we calculate the relative abundance of taxa for each environment at the **Class** level. It turns out this is not too easy in phyloseq and there is a lot of (messy) code.
Since our goal is to generate a figure and we only have 9 colors, some taxa will need to be put into an **Other** category. We can define 'Other' however we like so lets take a look at the overall relative abundance of each Class.
Inspecting the table it looks like if we choose a **cutoff of `r min_top_perc`** (`r min_top_perc*100`%) we get 8 taxa---sounds pretty good. The rest go into the 'Other' category. No matter what, we will always gloss over some groups using such a coarse approach. But as we see later some of these lower abundance groups will reappear when we look at communities at the level of individual ASVs. Next we define the **Other** category and draw the graph.
### **Class abundance across sample type**: Here we generate a bar chart to look at taxonomic abundance for each of the main environment types. {data-commentary-width=300}
```{r define_other}
# Here we conglomerate at 2%.
Other <- means[means$mean < min_top_perc, ]$Class
# or you can chose specifc taxa like this
#Other_manual <- c("list", "taxa", "in", "this", "format")
```
```{r metld_bar}
meltd[meltd$Class %in% Other, ]$Class <- "Other"
samp_names <- aggregate(meltd$Abundance, by = list(meltd$Sample), FUN = sum)[, 1]
.e <- environment()
meltd[, "Class"] <- factor(meltd[, "Class"], sort(unique(meltd[, "Class"])))
meltd <- meltd[order(meltd[, "Class"]), ]
# Here we order Classes by the Phylum they belong to.
#(meltd$Class)
meltd$Class <- factor(meltd$Class, levels = c("Bacteroidia", "Alphaproteobacteria",
"Gammaproteobacteria", "Oxyphotobacteria",
"Deltaproteobacteria", "Clostridia",
"Spirochaetia", "Mollicutes", "Other"))
```
```{r plot_bar_fig2A}
fig2A <- ggplot(meltd,
aes_string(x = "Sample", y = "Abundance", fill = "Class"),
environment = .e,
ordered = TRUE,
xlab = "x-axis label",
ylab = "y-axis label")
fig2A <- fig2A + geom_bar(stat = "identity",
position = position_stack(reverse = TRUE),
width = 0.95) + coord_flip() +
theme(aspect.ratio = 1/2)
fig2A <- fig2A + scale_fill_manual(values = friend_pal)
fig2A <- fig2A + theme(axis.text.x = element_text(angle = 0, hjust = 0.45, vjust = 1))
fig2A <- fig2A + guides(fill = guide_legend(override.aes = list(colour = NULL),
reverse = FALSE)) +
theme(legend.key = element_rect(colour = "black"))
fig2A <- fig2A + labs(x = "Environment", y = "Relative abundance (% total reads)")
fig2A <- fig2A + theme(axis.line = element_line(colour = "black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill = NA, size = 1))
#fig2A
pdf("FIGURES/OUTPUT/Figure_X_taxa_by_environment.pdf")
fig2A
#invisible(dev.off())
fp <- ggplotly(fig2A, height = 300)
fp %>% layout(margin = list(b = 100))
```
***
At a 1.5% abundance cutoff, `r length(Other)` Classes are grouped into the 'Other' category. Great, now we can craft the bar chart. And here are the taxa that will go into the chart:
It took some tweaking to get the bar chart to look just right---so there is a lot of code here---and it could most certainly be better :/. While we're at it we will also save a copy of the figure so we can tweak it later and make it look pretty.
One thing to notice is that the sediment samples have a large percentage of ASVs grouped into the **Other** category. We will see what these taxa are in the next panel.
### **Class abundance across sample type**: Here we generate a modified bar chart separated by individual samples. {data-commentary-width=400}
```{r supp_fig1_calc, cache = TRUE}
mdata_phy_all <- tax_glom(ps_slv_filt_AVG, taxrank = "Class", NArm = FALSE)
# You can choose any taxonomic level here
mdata_phyrel_all <- transform_sample_counts(mdata_phy_all, function(x) x/sum(x))
meltd_all <- psmelt(mdata_phyrel_all)
meltd_all$Class <- as.character(meltd_all$Class)
means_all <- ddply(meltd_all, ~Class, function(x) c(mean = mean(x$Abundance)))
means_all$mean <- round(as.numeric(means_all$mean), digits = 5)
# decending order
taxa_means_all <- means_all[order(-means_all$mean), ]
# ditch the sci notation
taxa_means_all <- format(taxa_means_all, scientific = FALSE)
top_perc_all <- top_n(taxa_means_all, n = 8, wt = mean)
top_perc_all$mean <- round(as.numeric(top_perc_all$mean), digits = 5)
min_top_perc_all <- round(as.numeric(min(top_perc_all$mean)), digits = 5)
Other <- means_all[means_all$mean < min_top_perc_all, ]$Class # Here we conglomerate at 2%.
meltd_all[meltd_all$Class %in% Other, ]$Class <- "Other"
samp_names <- aggregate(meltd_all$Abundance, by = list(meltd_all$Sample),
FUN = sum)[, 1]
.e <- environment()
meltd_all[, "Class"] <- factor(meltd_all[, "Class"], sort(unique(meltd_all[,
"Class"])))
meltd_all <- meltd_all[order(meltd_all[, "Class"]), ]
#levels(meltd_all$Class)
meltd_all$Class <- factor(meltd_all$Class, levels = c(
"Bacteroidia", "Alphaproteobacteria",
"Gammaproteobacteria", "Oxyphotobacteria",
"Clostridia", "Deltaproteobacteria",
"Spirochaetia", "Mollicutes",
"Other")) # Here we order Classes by the Phylum they belong to.
```
```{r supp_fig1, include=FALSE, echo=FALSE, eval=FALSE}
sup_fig1 <- qplot(data = meltd_all, x = ENV, y = Abundance,
fill = Class, geom = "boxplot", ylab = "Relative Abundance") +
theme(legend.position = "none") +
facet_grid(Class ~ ., scales = "free_y", space = "free_y") +
geom_jitter(width = 0.05) +
geom_point(colour = "black", fill = "white")
#+ guides(guide_legend(reverse = FALSE) )
sup_fig1 <- sup_fig1 +
scale_fill_manual(values = friend_pal) +
labs(x = "Environment", y = "Relative abundance (% total reads)")
qp <- ggplotly(sup_fig1, height = 300)
qp %>% layout(margin = list(b = 100))
#sup_fig1
pdf("FIGURES/OUTPUT/Figure_X_BAR.pdf")
sup_fig1
#invisible(dev.off())
```
```{r separate_bars_stacked}
options(scipen = 3)
options(digits = 4)
sup_fig2 <- ggplot(meltd_all,
aes_string(x = "Sample", y = "Abundance",
fill = "Class"), environment = .e, Ordered = TRUE)
sup_fig2 <- sup_fig2 +
geom_bar(stat = "identity", position = "stack") +
facet_grid(Class ~ ENV, scales = "free", space = "free_y")
sup_fig2 <- sup_fig2 +
scale_fill_manual(values = friend_pal)
# sup_fig2 <- sup_fig2 + theme(axis.text.x = element_text(angle = -90,
# hjust = 0))
sup_fig2 <- sup_fig2 + theme(axis.text.x = element_blank())
sup_fig2 <- sup_fig2 +
guides(fill = guide_legend(override.aes = list(colour = NULL),
reverse = FALSE)) +
theme(legend.key = element_rect(colour = "black"),
legend.position = "none") +
labs(x = "Individual samples",
y = "Relative abundance (% total reads)")
zp <- ggplotly(sup_fig2, height = 600)
zp %>% layout(margin = list(b = 100))
#plotly::ggplotly(sup_fig2, tooltip = c('Sample', 'Abundance'),
#originalData = TRUE, width = 600, height = 800)
#sup_fig2
pdf("FIGURES/OUTPUT/Figure_X_separate_bar.pdf")
sup_fig2
#invisible(dev.off())
```
***
Here we see the same data as in the last figure except now presented as individual samples. We see that the sediment samples all have a larger percentage of *Other* taxa. If we examine Phylum level abundance in sediment samples we see several groups that are not well represented in the rest of the dataset including
```{r sed_calc}
ps_slv_filt_AVG_sed <- prune_samples(c("SCC1", "SCC2", "SCC3", "SCR1", "SCR2", "SCR3"), ps_slv_filt_AVG)
ps_slv_filt_AVG_sed <- prune_taxa(taxa_sums(ps_slv_filt_AVG_sed) > 0, ps_slv_filt_AVG_sed)
mdata_phy_sed <- tax_glom(ps_slv_filt_AVG_sed, taxrank = "Phylum", NArm = FALSE)
mdata_phyrel_sed <- transform_sample_counts(mdata_phy_sed, function(x) x/sum(x))
meltd_sed <- psmelt(mdata_phyrel_sed)
meltd_sed$Class <- as.character(meltd_sed$Phylum)
means_sed <- ddply(meltd_sed, ~Phylum, function(x) c(mean = mean(x$Abundance)))
means_sed$mean <- round(as.numeric(means_sed$mean), digits = 5)
mdata_phy_comp <- tax_glom(ps_slv_filt_AVG, taxrank = "Phylum", NArm = FALSE)
mdata_phyrel_comp <- transform_sample_counts(mdata_phy_comp, function(x) x/sum(x))
meltd_comp <- psmelt(mdata_phyrel_comp)
meltd_comp$Class <- as.character(meltd_comp$Phylum)
means_comp <- ddply(meltd_comp, ~Phylum, function(x) c(mean = mean(x$Abundance)))
means_comp$mean <- round(as.numeric(means_comp$mean), digits = 5)
merge_sed_comp <- merge(means_sed, means_comp, by = "Phylum", all = TRUE)
merge_sed_comp <- merge_sed_comp[order(-merge_sed_comp$mean.x), ]
colnames(merge_sed_comp) <- c("Phylum", "mean Sed samples", "mean All samples")
kable(merge_sed_comp, row.names = FALSE) %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE) %>%
column_spec(1:3, width = "2.5cm")
```
### **$\alpha$-diversity estimates** Here we use several diversity indices and add the results to the summary table.
```{r gen_summary_table, warning = FALSE}
diversity <- estimate_richness(ps_slv_work_filt,
measures=c(
"Observed", "Chao1", "ACE",
"Shannon", "Simpson", "InvSimpson",
"Fisher"))
diversity_calc <- diversity %>% rownames_to_column("Sample_ID")
# round values
diversity_calc[c(3,5,10)] <- round(diversity_calc[c(3,5,10)], 1)
diversity_calc[c(4,6,7,9)] <- round(diversity_calc[c(4,6,7,9)], 2)
diversity_calc[8] <- round(diversity_calc[8], 3)
#host_summary <- merge(host_details, diversity_calc)
#host_summary$Observed <- NULL
merge_tab2$total_ASVs <- NULL
sample_table <- merge(merge_tab2, diversity_calc, by = "Sample_ID")
datatable(sample_table,
rownames = FALSE,
width = "100%",
colnames = c("Sample_ID", "Environment", "Site", "total reads",
"No. ASVs", "Chao1", "Chao1 (SE)",
"ACE", "ACE (SE)", "Shannon",
"Simpson", "InvSimpson", "Fisher"),
caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;",
"Table NA: ", htmltools::em("Alpha diversity estimates.")),
extensions = "Buttons",
options = list(columnDefs = list(list(className = "dt-center", targets = "_all")),
dom = "Blfrtip", buttons = c("csv", "copy"),
scrollX = TRUE, scrollCollapse = TRUE, deferRender=TRUE, scrollY=300, scroller=TRUE,
lengthMenu = c(10, 20, 30)))
write.table(sample_table, "DATA_TABLES/OUTPUT/Table_X_alpha_diversity.txt",
sep = "\t", row.names = FALSE, quote = FALSE)
```
***
Alpha diversity describes the diversity in a sample or site. There are several alpha diversity metrics available in phyloseq: `Observed`, `Chao1`, `ACE`, `Shannon`, `Simpson`, `InvSimpson`, `Fisher`. Play around to see how different metrics change or confirm these results.
Here we want to know if diversity is significantly different across environments. In order to do that we need to know if we should run a parametric or non-parametric test, and for that we need to know if our data is normally distributed. Most of the ideas/code for alpha (and subsequent beta) diversity statistics come from this [workshop tutorial](https://rpubs.com/maddieSC/R_SOP_UCR_Jan_2018){target="_blank"} by Kim Dill-McFarland and Madison Cox.
First we run the diversity estimates, add these data to our summary table, and save a copy of this table.
### **$\alpha$-diversity statistics**: Test whether the results are normally distributed. Results of this will help guide the analyses we do next. {data-commentary-width=400}
```{r alpha_div_test_norm, warning = FALSE}
# Convert to ps object
sample_div <- sample_data(diversity)
# Create new ps object with diversity estimates added to sample_data
ps_slv_work_filt_div <- merge_phyloseq(ps_slv_work_filt, sample_div)
# Run Shapiro test
shapiro_test_Shan <- shapiro.test(sample_data(ps_slv_work_filt_div)$Shannon)
shapiro_test_invSimp <- shapiro.test(sample_data(ps_slv_work_filt_div)$InvSimpson)
shapiro_test_Chao1 <- shapiro.test(sample_data(ps_slv_work_filt_div)$Chao1)
shapiro_test_Observed <- shapiro.test(sample_data(ps_slv_work_filt_div)$Observed)
```
>Shapiro-Wilk Normality Test for **Shannon** index.
```{r shap_Shan, echo = FALSE}
shapiro_test_Shan
```
>Shapiro-Wilk Normality Test for **inverse Simpson** index.
```{r shap_invS, echo = FALSE}
shapiro_test_invSimp
```
>Shapiro-Wilk Normality Test for **Chao1** richness estimator.
```{r shap_Choa1, echo = FALSE}
shapiro_test_Chao1
```
> Shapiro-Wilk Normality Test for **Observed ASV richness** estimator.
```{r shap_Observed, echo = FALSE}
shapiro_test_Observed
```
***
Next, we add the diversity estimates to our phyloseq object, and test if the data are normally distributed using Shapiro-Wilk Normality test. We will focus on the inverse Simpson and Shannon diversity estimates, the Chao richness estimate, and Observed richness but this approach can be used for any metric.
Ok, since the **p-values are significant** for the inverse Simpson, Chao richness, and Observed ASV richness we **reject the null hypothesis** that these data are normally distributed.
However, the Shannon estimates appear normally distributed. So lets see if diversity is significantly different between environment based on the Shannon index.
### **Assessing significance of $\alpha$-diversity metrics**: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.{data-commentary-width=500}
>Normally distributed metrics
```{r normal}
sampledataDF <- data.frame(sample_data(ps_slv_work_filt_div))
sampledataDF$ENV <- as.factor(sampledataDF$ENV)
sampledataDF$SITE <- as.factor(sampledataDF$SITE)
aov.shannon = aov(Shannon ~ ENV, data = sampledataDF)
#Call for the summary of that ANOVA, which will include P-values
summary(aov.shannon)
```
```{r tukey}
TukeyHSD(aov.shannon)
```
>Non-normally distributed metrics
Kruskal-Wallis of **inverse Simpson** index.
```{r krusk_invsimp}
#library(FSA)
#dunnTest(InvSimpson ~ Sp, data = sampledataDF, method="bh")
kruskal.test(InvSimpson ~ ENV, data = sampledataDF)
```
Pairwise significance test for **inverse Simpson** index.
```{r wilcox_invsimp}
pairwise.wilcox.test(sampledataDF$InvSimpson, sampledataDF$ENV, p.adjust.method="fdr")
```
Kruskal-Wallis of **Chao1 richness** estimator.
```{r krusk_chao}
#dunnTest(Chao1 ~ Sp, data = sampledataDF, method="bh")
kruskal.test(Chao1 ~ ENV, data = sampledataDF)
```
Pairwise significance test for **Chao1 richness** estimator.
```{r wilcox_chao}
pairwise.wilcox.test(sampledataDF$Chao1, sampledataDF$ENV, p.adjust.method="fdr")
```
Kruskal-Wallis of **Observed ASV richness** index.
```{r krusk_observed}
#library(FSA)
#dunnTest(Observed ~ Sp, data = sampledataDF, method="bh")
kruskal.test(Observed ~ ENV, data = sampledataDF)
```
Pairwise significance test for **Observed ASV richness** index.
```{r wilcox_observed, warning = FALSE}
pairwise.wilcox.test(sampledataDF$Observed, sampledataDF$ENV, p.adjust.method="fdr")
```
***
Since the Shannon data is normally distributed we can test for significance using ANOVA (a parametric test).
Ok, the results of the ANOVA are significant. Here we use the Tukey's HSD (honestly significant difference) post-hoc test to determine which pairwise comparisons are different.
Looks like everything is significantly different except for Mat-Sediment communities from Cayo Roldan. Interesting.
Now we can look at the results on the inverse Simpson diversity and Chao's richness. Since Environment is categorical, we use Kruskal-Wallis (non-parametric equivalent of ANOVA) to test for significance.
For the inverse Simpson, Chao1, and Observed richness the results of the Kruskal-Wallis rank sum test are significant. So we can look at pairwise comparisons using Wilcoxon rank sum test for post-hoc analysis.
### **$\alpha$-diversity plots**: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.{data-commentary-width=300}
>Alpha diversity of microbial communities
```{r alpha_div_fig_2B, warning = FALSE}
samp_pal <- c("#CC79A7", "#E69F00", "#009E73", "#56B4E9")
fig2B <- plot_richness(ps_slv_work_filt, x = "ENV",
measures = c("Observed",
"Shannon",
"InvSimpson",
"Chao1"),
color = "ENV", nrow = 1)
fig2B <- fig2B + geom_boxplot() + geom_jitter(width = 0.05)
fig2B <- fig2B + scale_colour_manual(values = samp_pal) +
labs(x = "Environment",
y = "Diversity")
#fig2B + geom_boxplot(aes(colour = black))
#fig2B <- fig2B + theme_bw() + geom_point(size = 2.5, aes(color = ENV)) +
ap <- ggplotly(fig2B, height = 400)
ap %>% layout(margin = list(b = 100))
#fig2B
pdf("FIGURES/OUTPUT/Figure_X_richness.pdf")
fig2B
#invisible(invisible(dev.off()))
```
***
Here we plot the results of each diversity index and include the *Observed* ASV richness. We will save a copy of the figure for later tweaking. We use the color palette described above to delineate environments.
### **$\beta$-diversity analysis**: NMDS coupled with Jensen-Shannon divergence. {data-commentary-width=400}
```{r run_nmds, results = "hide"}
set.seed(3131)
ord.nmds.jsd_slv <- ordinate(ps_slv_work_filt, method = "NMDS", distance = "bray")
```
```{r ord_results}
ord.nmds.jsd_slv
```
***
Beta diversity basically tells us how similar or dissimilar samples are to one another. Phyloseq offers several ordination option using the `method` flag and `distance` metrics. Here we use non metric multidimensional scaling (NMDS) coupled with Jensen-Shannon divergence. We also save a copy of the figure for later tweaking.
We see that a convergent solution was reached around 20 iterations and our stress is below 0.20, meaning that 2-axes are sufficient to view the data. Generally, we are looking for stress values below 0.2. If the stress values are high, you may need to add more axes to the ordination. Lets visualize the plot.
### **$\beta$-diversity NMDS plot **: {data-commentary-width=400}
```{r beta_div_fig_2C}
fig2C <- plot_ordination(ps_slv_work_filt, ord.nmds.jsd_slv,
color = "ENV",
shape = "SITE")
fig2C <- fig2C + geom_point(size = 4.6, stroke = 0.075) +
scale_colour_manual(values = samp_pal)
#fig2C <- fig2C + scale_colour_manual(values = samp_pal)
fig2C <- fig2C + theme(axis.line = element_line(colour = "black"),
panel.background = element_blank(),
panel.grid.major = element_line("grey"),
panel.grid.minor = element_line("grey"),
panel.border = element_rect(colour = "black", fill = NA, size = 1)) +
theme(legend.key = element_rect(colour = "black"))
#fig2C <- fig2C + coord_fixed()
wp <- ggplotly(fig2C, height = 600)
wp %>% layout(margin = list(b = 100))
#fig2C <- fig2C + stat_ellipse(type = "t") + theme_bw()
#plotly::ggplotly(fig2C, tooltip = 'ENV', originalData = TRUE)
#fig2C
pdf("FIGURES/OUTPUT/Figure_X_NMDS.pdf")
fig2C
invisible(dev.off())
```
***
So we can see some clustering within groups and spread between groups, but this is not a test for statistical differences. Do microbial communities differ significantly by environment?
### **$\beta$-diversity statistics **: {data-commentary-width=400}
First we use the `adonis` function in vegan to run a PERMANOVA test. This will tell us whether environments have similar centroids or not.
```{r ordination_stats_adonis}
set.seed(1911)
fish.jsd <- phyloseq::distance(ps_slv_work_filt, method = "jsd")
sampledf <- data.frame(sample_data(ps_slv_work_filt))
fish_adonis <- adonis(fish.jsd ~ ENV, data = sampledf, permutations = 1000)
fish_adonis
```
These results indicate that centroids are significantly different across environments meaning that communities are different by environments.
We can also use the `pairwiseAdonis` package for pair-wise PERMANOVA analysis.
```{r pairwise_adonis}
pairwise.adonis(fish.jsd, factors = sampledf$ENV, p.adjust.m = "bonferroni")
```
Here we see again we see that communities are different by environments.
However, PERMANOVA assumes equal beta dispersion so we will use the `betadisper` function from the `vegan` package to calculate beta dispersion values.
```{r betadisper}
beta_adonis <- betadisper(fish.jsd, sampledf$ENV, bias.adjust = TRUE)
beta_adonis
```
And then a pair-wise Permutation test for homogeneity of multivariate dispersions using `permutest` (again from the `vegan` package).
```{r permutest}
permutest(beta_adonis, pairwise=TRUE, permutations=1000)
```
These results are significant, meaning that environments have different dispersions. Looking at the pairwise p-values and permuted p-value, we see that the significant differences (p-value < 0.05) are between most environments.
This means we are less confident that the PERMANOVA result is a real result, and that the result is possibly due to differences in group dispersions.
We can also use Analysis of Similarity (ANOSIM)---which does not assume equal group variances---to test whether overall microbial communities differ by environment.
```{r ordination_stats_anosim}
spgroup <- get_variable(ps_slv_work_filt, "ENV")
fish_anosim <- anosim(distance(ps_slv_work_filt, "jsd"), grouping = spgroup)
summary(fish_anosim)
```
And the AN0SIM result is significant meaning that environment influences microbial community composition.
```{r simper, eval = FALSE, echo = FALSE, include = FALSE}
source("/Users/j/Desktop/FISH_MICROBIOME/02_PHYLOSEQ_WORKFLOW/HELPER_SCRIPTS/simper_pretty.R")
#Using the function
otutab <- as.table(otu_table(ps_slv_work_filt))
stuff <- simper.pretty(otu_table(ps_slv_work_filt), sample_data(ps_slv_work_filt), "Sp", perc_cutoff=0.5, low_cutoff = 'y', low_val=0.01, 'name')
```
***
To test whether microbial communities differ by environment we can use permutational analysis of variance (PERMANOVA) or analysis of similarity (ANOSIM). PERMANOVA does not assume normality but does assume equal beta dispersion between groups. We will test beta dispersion below.
Water {.storyboard}
=========================================
### **Site differences**: Here we look at differences between the two sites. For {data-commentary-width=400}
```{r, echo=FALSE}
# Working directory
# Load SV table
water_seq_table <- read.csv("DATA_TABLES/INPUT/water_seq_table.csv",
stringsAsFactors = FALSE, row.names = 1)
# SV table
water_seq_table <- as.data.frame(t(water_seq_table))
# Delete columns when sum == 0
water_seq_table <- water_seq_table[, which(colSums(water_seq_table) != 0)]
# Add First column with name of each group
water_seq_table <- cbind(DO =
c("normoxic","normoxic","normoxic","normoxic","normoxic"
,"hypoxic","hypoxic","hypoxic"), water_seq_table)
library(labdsv)
iva <- indval(water_seq_table[,-1], water_seq_table[,1])
gr <- iva$maxcls[iva$pval<=0.05]
iv <- iva$indcls[iva$pval<=0.05]
pv <- iva$pval[iva$pval<=0.05]
fr <- apply(water_seq_table[,-1]>0, 2, sum)[iva$pval<=0.05]
indvalsummary <- data.frame(group=gr, indval=iv, pvalue=pv, freq=fr)
indvalsummary <- indvalsummary[order(indvalsummary$group, -indvalsummary$indval),]
write.csv(indvalsummary, "DATA_TABLES/OUTPUT/indvalsummary.csv")
# Load Tax table
water_tax_table <- read.csv("DATA_TABLES/INPUT/water_tax_table.csv",
stringsAsFactors = FALSE, row.names = 1) # tax table
# Merge indvalsummary with tax names
indvalsummary_tax <- merge(indvalsummary,
water_tax_table, by="row.names",
all=TRUE)
# Drop rows with NAs
indvalsummary_tax <- indvalsummary_tax[!(is.na(indvalsummary_tax$group)),]
# Rename leves of group factor
# indvalsummary_tax <- as.character(indvalsummary_tax$group)
```
```{r, include = FALSE}
lapply(indvalsummary_tax, class)
class(indvalsummary_tax$group) = "character"
indvalsummary_tax$group <- ifelse(indvalsummary_tax$group == "1",
as.character("hypoxic"),
as.character("normoxic"))
```
```{r}
datatable(indvalsummary_tax,
rownames = FALSE,
width = "100%",
caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;",
"Table NA: ", htmltools::em("DESCRIPTION.")),
extensions = "Buttons",
options = list(columnDefs = list(list(className = "dt-center", targets = "_all")),
dom = "Blfrtip", buttons = c("csv", "copy"),
scrollX = TRUE, scrollCollapse = TRUE, deferRender=TRUE, scrollY=300, scroller=TRUE,
lengthMenu = c(10, 50, 160)))
write.csv(indvalsummary_tax,
"DATA_TABLES/OUTPUT/Hypoxia_indvalsummaryWATER_tax.csv",
row.names = F)
```
***
```{r}
asv_list <- as.vector(indvalsummary_tax$Row.names)
# Create ps objects
da_asvs <- prune_taxa(asv_list, ps_slv_water)
da_asvs_full <- prune_taxa(asv_list, ps_slv_work_filt)
# Create fasta file from tax_table
table2format <- tax_table(da_asvs)
#retain only the column with the sequences
table2format_trim <- table2format[, 7]
table2format_trim_df <- data.frame(row.names(table2format_trim), table2format_trim)
colnames(table2format_trim_df) <- c("ASV_ID", "ASV_SEQ")
table2format_trim_df$ASV_ID <- sub("ASV" , ">ASV", table2format_trim_df$ASV_ID) #format fasta
write.table(table2format_trim_df, "DATA_TABLES/OUTPUT/da_asv.fasta", sep = "\r",
col.names = FALSE, row.names = FALSE, quote = FALSE, fileEncoding = "UTF-8")
```